Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Implement Intel USM extension #1241

Merged
merged 3 commits into from
Mar 14, 2020

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Mar 4, 2020

Implement the following functions in the CUDA plugin, and mark the tests for the USM features that are now supported.

Also, fix the CUDA version reported by SYCL.

Device

  • USM-related calls to piDeviceGetInfo()

Kernel

  • piextKernelSetArgPointer()

USM

  • piextUSMHostAlloc()
  • piextUSMDeviceAlloc()
  • piextUSMSharedAlloc()
  • piextUSMFree()
  • piextUSMEnqueueMemset()
  • piextUSMEnqueueMemcpy()
  • piextUSMEnqueuePrefetch()
  • piextUSMEnqueueMemAdvise() (see below)
  • piextUSMGetMemAllocInfo() (see below)

Limitations

As the Intel USM extension is still incomplete:

  • piextUSMEnqueuePrefetch() ignores the "flags" argument;
  • piextUSMEnqueueMemAdvise() does nothing.

@fwyzard

This comment has been minimized.

@fwyzard fwyzard changed the title [SYCL][CUDA] Implement part of USM for cuda [SYCL][CUDA] Implement part of USM Mar 4, 2020
@fwyzard fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 99c9d32 to 1af2918 Compare March 4, 2020 00:47
@bader bader added the cuda CUDA back-end label Mar 4, 2020
@bader bader requested a review from jbrodman March 4, 2020 09:11
@bader
Copy link
Contributor

bader commented Mar 4, 2020

@fwyzard, thanks a lot for working on this.
@Ruyk, @bjoernknafla, could you take a look, please?

@fwyzard fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 1af2918 to d0c1f96 Compare March 4, 2020 13:50
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 4, 2020

Updated to use cuMemcpyAsync directly, instead of dispatching to cuMemcpyHtoDAsync, cuMemcpyDtoHAsync, etc.

Ruyk
Ruyk previously approved these changes Mar 4, 2020
Copy link
Contributor

@Ruyk Ruyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! @jbrodman will be happy :-) Few comments around.
Have you enabled any of the USM lit test to see if they pass?

sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved
sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved

if (event) {
retImplEv = std::unique_ptr<_pi_event>(
_pi_event::make_native(PI_COMMAND_MEMBUFFER_COPY, command_queue));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not in the scope of this change, but the PI_COMMAND_MEMBUFFER_COPY is used for the non-usm operations. We probably need a different PI_COMMAND_USM_COPY for usm operations, since the OpenCL function is different in the cl_intel_unified_sahred_memory extension

sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved
sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved
sycl/plugins/cuda/pi_cuda.cpp Outdated Show resolved Hide resolved
bjoernknafla
bjoernknafla previously approved these changes Mar 5, 2020
@fwyzard fwyzard dismissed stale reviews from bjoernknafla and Ruyk via 14ce4c4 March 5, 2020 17:57
@fwyzard fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch 4 times, most recently from ae88ebb to 7f60dbf Compare March 5, 2020 18:44
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 5, 2020

I've implemented piextUSMSharedAlloc() and piextUSMEnqueueMemset(), and hopefully addressed most of the comments.

I still need to look into using cuMemHostAlloc() with the optional write-combine flag instead of cuMemAllocHost().

If I missed anything, please mention it (again).

@fwyzard fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 7f60dbf to a1e51fc Compare March 5, 2020 18:51
@Ruyk
Copy link
Contributor

Ruyk commented Mar 6, 2020

Still looking to see if there are more lit tests passing, USM tests are marked as "unsupported" for the CUDA backend so they won't report if they are passing now.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 6, 2020

Still looking to see if there are more lit tests passing, USM tests are marked as "unsupported" for the CUDA backend so they won't report if they are passing now.

How do I run the USM tests with the CUDA backend ?

I have tried removing the // XFAIL: cuda line from all files under sycl/test/usm/, and running make check-sycl-cuda-usm in the build directory, and I got:

...
getDeviceCount cpu:PI_CUDA
Found available CPU device
getDeviceCount gpu:PI_CUDA
Found available GPU device
getDeviceCount accelerator:PI_CUDA
Could not find AOT device compiler opencl-aot
Found AOT device compiler ocloc
Could not find AOT device compiler aoc

Testing Time: 14.15s
  Expected Passes    : 24

Which is not what I expected, because

  • "Found available CPU device" makes no sense for CUDA
  • "Expected Passes : 24" (i.e. all) is too many, I would have expected same failers, since not all USM functions are implements.

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 6, 2020

In fact, looking at the tests being built and run, it looks like make check-sycl-cuda-usm

  • still builds with -fsycl-targets=spir64-unknown-linux-sycldevice instead of -fsycl-targets=nvptx64-unknown-linux-sycldevice
  • does not pass SYCL_BE=PI_CUDA at runtime

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 9, 2020

@bashbaug @jbrodman

About the behaviour of piDeviceGetInfo() for PI_USM_HOST_SUPPORT, PI_USM_DEVICE_SUPPORT, PI_USM_SINGLE_SHARED_SUPPORT, PI_USM_CROSS_SHARED_SUPPORT, and PI_USM_SYSTEM_SHARED_SUPPORT: re-reading the OpenCL cl_intel_unified_shared_memory extension, the SYCL USM proposal, the Level Zero Memory documentation and the CUDA driver API for Unified Addressing, it's starting to make sense...

However, it looks like there are still some corner case that are not very clear.
Is the API finalised ?

According to Table 5 in cl_intel_unified_shared_memory:

The host memory access capabilities apply to any host allocation.

so, PI_USM_HOST_SUPPORT queries whether a CUDA device can access memory allocated on the host by cuMemAllocHost (or cuMemHostAlloc), i.e. zero-copy memory with Unified Addressing.


The device memory access capabilities apply to any device allocation associated with this device.

so PI_USM_DEVICE_SUPPORT queries whether a CUDA device can access memory allocated on the device itself with cuMemAlloc, which I would take for granted ?

The table under USM Allocations in the USM proposal mentions for the device allocations also the possibility of peer-to-peer access from other devices. How is that queried ?
piDeviceGetInfo() has a single device parameter, so I wouldn't know how to check that.


The single device shared memory access capabilities apply to any shared allocation associated with this device.

so PI_USM_SINGLE_SHARED_SUPPORT query whether the device supports managed memory, allocated with cuMemAllocManaged.


The cross-device shared memory access capabilities apply to any shared allocation associated with this device, or to any shared memory allocation on another device that also supports the same cross-device shared memory access capability.

so PI_USM_CROSS_SHARED_SUPPORT queries whther the device supports peer-to-peer access to managed memory.


The shared system memory access capabilities apply to any allocations made by a system allocator, such as malloc or new.

so PI_USM_SYSTEM_SHARED_SUPPORT queries whether the device can access host memory allocated by the system allocator (e.g. malloc() and new); for CUDA devices this is only available on Power9 machines as far as I know, so I have no way to test it.

@fwyzard fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch 2 times, most recently from 89fd6ab to edf470d Compare March 9, 2020 19:36
@romanovvlad romanovvlad assigned jbrodman and unassigned romanovvlad Mar 10, 2020
@jbrodman
Copy link
Contributor

@bashbaug @jbrodman

About the behaviour of piDeviceGetInfo() for PI_USM_HOST_SUPPORT, PI_USM_DEVICE_SUPPORT, PI_USM_SINGLE_SHARED_SUPPORT, PI_USM_CROSS_SHARED_SUPPORT, and PI_USM_SYSTEM_SHARED_SUPPORT: re-reading the OpenCL cl_intel_unified_shared_memory extension, the SYCL USM proposal, the Level Zero Memory documentation and the CUDA driver API for Unified Addressing, it's starting to make sense...

However, it looks like there are still some corner case that are not very clear.
Is the API finalised ?

No. It's mostly complete, but there are still things that might change as we get more experience and usage.

According to Table 5 in cl_intel_unified_shared_memory:

The host memory access capabilities apply to any host allocation.

so, PI_USM_HOST_SUPPORT queries whether a CUDA device can access memory allocated on the host by cuMemAllocHost (or cuMemHostAlloc), i.e. zero-copy memory with Unified Addressing.

Yes.

The device memory access capabilities apply to any device allocation associated with this device.

so PI_USM_DEVICE_SUPPORT queries whether a CUDA device can access memory allocated on the device itself with cuMemAlloc, which I would take for granted ?

Yes.

The table under USM Allocations in the USM proposal mentions for the device allocations also the possibility of peer-to-peer access from other devices. How is that queried ?

TBD.

The single device shared memory access capabilities apply to any shared allocation associated with this device.

so PI_USM_SINGLE_SHARED_SUPPORT query whether the device supports managed memory, allocated with cuMemAllocManaged.

Correct.

The cross-device shared memory access capabilities apply to any shared allocation associated with this device, or to any shared memory allocation on another device that also supports the same cross-device shared memory access capability.

so PI_USM_CROSS_SHARED_SUPPORT queries whther the device supports peer-to-peer access to managed memory.

Not entirely. P2P is kind of a special case. This is more is one shared allocation able to migrate between different devices in a platform. P2P is really more about if it can migrate directly without needing to go through the host.

The shared system memory access capabilities apply to any allocations made by a system allocator, such as malloc or new.

so PI_USM_SYSTEM_SHARED_SUPPORT queries whether the device can access host memory allocated by the system allocator (e.g. malloc() and new); for CUDA devices this is only available on Power9 machines as far as I know, so I have no way to test it.

Correct. I don't expect to really be done until multiple vendors implement the same protocol for coherent device attach.

jbrodman
jbrodman previously approved these changes Mar 10, 2020
Copy link
Contributor

@jbrodman jbrodman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 10, 2020

@jbrodman thank you very much for the clarifications.

PI_USM_SINGLE_SHARED_SUPPORT query whether the device supports managed memory, allocated with cuMemAllocManaged.

Correct.

PI_USM_CROSS_SHARED_SUPPORT queries whether the device supports peer-to-peer access to managed memory.

Not entirely. P2P is kind of a special case. This is more one shared allocation able to migrate between different devices in a platform. P2P is really more about if it can migrate directly without needing to go through the host.

I see.
Then I think I should slightly change the answer for the queries to PI_USM_SINGLE_SHARED_SUPPORT and PI_USM_CROSS_SHARED_SUPPORT to reflect this.

@fwyzard fwyzard requested review from bader and jbrodman March 13, 2020 21:48
@fwyzard fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 59316da to da4eebb Compare March 13, 2020 21:58
@bader
Copy link
Contributor

bader commented Mar 14, 2020

@fwyzard the latest commit is not signed and it looks like it have been squashed.
Could you also update the branch, please? I merged conflicting PR: #1299.

Other than that I don't have any other comments to this patch. Thanks!

fwyzard added 3 commits March 14, 2020 07:33
Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>
Device
  - USM-related calls to piDeviceGetInfo

Kernel
  - piextKernelSetArgPointer

USM
  - piextUSMHostAlloc
  - piextUSMDeviceAlloc
  - piextUSMSharedAlloc
  - piextUSMFree
  - piextUSMEnqueueMemset
  - piextUSMEnqueueMemcpy
  - piextUSMEnqueuePrefetch (*)
  - piextUSMEnqueueMemAdvise (*)
  - piextUSMGetMemAllocInfo

(*) due to the incomplete documentation of the USM extension:
  - piextUSMEnqueuePrefetch ignores the "flags" argument;
  - piextUSMEnqueueMemAdvise does nothing.

Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>
Signed-off-by: Andrea Bocci <andrea.bocci@cern.ch>
@fwyzard fwyzard force-pushed the Implement_part_of_USM_for_CUDA branch from 7039ac0 to 8f3d479 Compare March 14, 2020 06:47
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 14, 2020

Fixed commit history, rebased on top of #1310 .

@bjoernknafla
Copy link
Contributor

#1310 has been merged.

@fwyzard did you manage to get the LIT tests to work for CUDA? I am not sure (haven’t tested) if your SYCL_BE env var fix would be enough or if you need the more extensive fix of the get-device-count-by-type tool for it?

@bader bader merged commit 498d56c into intel:sycl Mar 14, 2020
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 14, 2020

@fwyzard did you manage to get the LIT tests to work for CUDA? I am not sure (haven’t tested) if your SYCL_BE env var fix would be enough or if you need the more extensive fix of the get-device-count-by-type tool for it?

I'm actually testing with all these PRs on top o fthe sycl branch:

but I haven't gotten around to trying the LIT tests, yet.

@fwyzard fwyzard deleted the Implement_part_of_USM_for_CUDA branch March 14, 2020 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants